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ABSTRACT 

Efficient identification and follow-up of astronomical transients is hindered by the need 
for humans to manually select promising candidates from data streams that contain 
many false positives. These artefacts arise in the difference images that are produced 
by most major ground-based time domain surveys with large format CCD cameras. 
This dependence on humans to reject bogus detections is unsustainable for next gen¬ 
eration all-sky surveys and significant effort is now being invested to solve the problem 
computationally. In this paper we explore a simple machine learning approach to real- 
bogus classification by constructing a training set from the image data of ^32000 real 
astrophysical transients and bogus detections from the Pan-STARRSl Medium Deep 
Survey. We derive our feature representation from the pixel intensity values of a 20x20 
pixel stamp around the centre of the candidates. This differs from previous work in 
that it works directly on the pixels rather than catalogued domain knowledge for 
feature design or selection. Three machine learning algorithms are trained (artificial 
neural networks, support vector machines and random forests) and their performances 
are tested on a held-out subset of 25% of the training data. We find the best results 
from the random forest classifier and demonstrate that by accepting a false positive 
rate of 1%, the classifier initially suggests a missed detection rate of around 10%. How¬ 
ever we also find that a combination of bright star variability, nuclear transients and 
uncertainty in human labelling means that our best estimate of the missed detection 
rate is approximately 6%. 

Key words: methods: data analysis, methods: statistical, techniques: image process¬ 
ing, surveys, supernovae: general 


1 INTRODUCTION 

Current transient surveys such as Pan-STARRSl (PS1) 
(Kaiser et al.||2010|), PTF (Rau et al. 2009), LSQ ( |Baltay| 


et al. 2013), SkyMapper (Keller et al. 2007) and CRTS 


(Drake et al. 2009) are efficient discoverers of astrophysi¬ 


cal transients. To make these surveys possible it has become 
necessary to automate every step in the data pipeline includ- 


E-mail: dwright04@qub.ac.uk 


ing data collection, archiving and reduction. A major goal 
for time-domain astrophysics is early detection and rapid 
follow-up to enable complete data sets for transients. Arte¬ 
fact rejection has become the bottle-neck between fast tran¬ 
sient detection and our ability to feed these targets to follow¬ 
up surveys such as PESSTO (Sm artt et al.||2013 ) and PTF 
for early classification. Current artefact rejection typically 
involves deriving some set of parameters from the image 
data of individual detections and thresholding each parame¬ 
ter, only promoting those detections that pass the thresholds 
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to humans for verification. 

The numbers of detections that must be scanned by hu¬ 
mans is still on the order of hundreds of objects each night 
with a high false positive rate. The processing artefacts pro¬ 
duced are a result of many factors such as saturated sources, 
convolution issues and detector defects amongst others, and 
to a large extent are common across all surveys. For the 
next generation of survey we cannot expect humans to re¬ 
main involved in this process of artefact rejection to the 
same extent, where for example we expect on the order of 
10 6 transient detections per night from LSST0 

Significant effort has been devoted to this problem in 
anticipation of these next generation surveys, and to enable 
rapid turn around from detection to classification for current 
surveys. Machine learning techniques have been used to take 
advantage of the large amounts of data gathered by these 
surveys to train a classifier that can distinguish real astro- 
physical transients from artefacts or ‘bogus’ detections. Ex¬ 


amples include Donalek et al. (2008) for the Palomar-Quest 
survey, Romano, Aragon, &; Ding (2006) for SNFactory, and 


Bailey et al. (2007) and du Buisson et al. (2014) for SDSS. 


PTF have demonstrated the ability to efficiently characterise 
detections and initiate rapid follow-up, see | Gal-Yam et al.| 
(2014) for example, where the problem of real-bogus classifi¬ 


cation has been addressed by the work of Bloom et al. (2012) 


and Brink et al. (2013). While these studies do achieve high 


levels of performance, the parameters chosen to represent 
the images are often dependent on the specific implementa¬ 
tion and strategy of the individual surveys. 

In this paper we investigate a simple representation of 
the images by using the pixel intensities in a region around 
a detection in a single difference image. This choice of pa- 
rameterisation is independent of other aspects of the sur¬ 
vey, and is therefore applicable to any survey performing 
difference imaging while also lending itself to implementa¬ 
tion much earlier in the data processing pipeline (potentially 
at the source extraction stage). We begin by outlining the 
real-bogus problem in the context of PS1 in Section [2] fol¬ 
lowed by a description of our training set and image param- 
eterisation in Section [3] In Section [4] we discuss the vari¬ 
ous machine learning algorithms we investigate, outline how 
we select the optimum classifier, and report its performance 
compared with previous work. We continue in Section [5] with 
some further analysis to help understand how we expect the 
classifier to perform on a live data stream. Finally we sum¬ 
marise our results and conclude in Section [6] 


2 PS1 AND THE PROBLEM OF REAL-BOGUS 
CLASSIFICATION 

The Pan-STARRSl system comprises a 1.8 m primary 
mirror (Hodapp et al. 2004) and a held-of-view of 3.3 deg 
imaged by 60 4800x4800 pixel detectors, constructed from 
10 /im pixels subtending 0.258 arcsec (for more details, see 
Magnier et al. (2013)). The PS1 filter system consists of 5 


filters, gpi, rpi, ipi, zpi similar to SDSS griz (York et al. 


2000) with the addition of ypi, which extends redward 
of zpi. The system is described in detail by |Tonry et al.| 


(2012b). The PS1 Science Consortium (PS1SC) operates 
the PS1 telescope performing 2 major surveys. The Medium 
Deep Survey (MDS) ( |Tonry et al. |2012a| is allocated 
25% of observing time for high cadence observations of 10 
fields, each the size of the PS1 field-of-view. The wide-field 
37r survey with 56% observing time aims to observe the 
entire sky north of —30 deg. declination with a total of 20 
exposures per year in all five filters for each pointing. 

In this paper we use images from the MDS. Each 
night 3-5 of the MDS fields are observed. Each epoch is 
composed of eight dithered exposures of 8 X 113 s in gpi 
and rpi, or 8 x 240 s in ipi, zpi and ypi, producing nightly 


stacked images of 904 and 1632 s duration (Tonry et al. 


2012a). Each stack achieves 5cr depths of around 23.3 mag 
in gpi, rpi, ipi, zpi and 21.7 mag in ypi. Images from 
the PS1 system are processed by the Image Processing 
Pipeline (IPP; [Magnier (2006)), on a computer cluster at 
the Maui High Performance Computer Center (MHPCC). 
The images are passed through a series of processing stages 
including device detrending, masking and artefact loca¬ 
tion. Detrending includes bias correction and flat-fielding 
using white light flat-field images from a dome screen, in 
combination with an illumination correction obtained by 
rastering sources across the held-of-view. After deriving 
an initial astrometric solution, the flat-fielded images are 
then warped onto the tangent plane of the sky using a 
flux-conserving algorithm. The plate scale for the warped 
images was originally set at 0.200 arcsec pixel -1 , but has 
since been changed to 0.25 arcsec pixel -1 in what is known 
internally as the V3 tessellation for the MDS fields. Bad 
pixels are masked on the individual images and carried 
through the stacking stage to give the nightly stacks. 

Difference imaging is performed on a daily basis by 
two independent pipelines. IPP takes the nightly stacks 
and creates difference images by subtracting a high-quality 
reference image from the new data. Point spread function 
(PSF) photometry is then performed on the difference 
images to produce catalogues of variables and transient 
candidates (Gezari et al. |2012| |McCrum et al. 2014). The 
Transient Science Server (TSS) developed by the PS ISC 
ingests catalogues of detections of residual flux in the 
difference images and presents potential transients for 
human eyeballing. 

In parallel, an independent set of difference images 
are produced at the Centre for Astrophysics at Harvard 


from the nightly stack images using the PHOTPIPE (Rest 
et al.||2014 2005) software. A custom-built reference stack 
is produced and subtracted from the IPP nightly stack 
to produce an independent difference image. This process 


is described in Gezari et al. (2010), Gezari et al. (2012) 


Chomiuk et al. (2011), Berger et al. (2012), Chornock 


et al. (2013) and Lunnan et al. (2013), and potential 


transients are visually inspected for promotion to the status 
of transient alert. A cross-match between the TSS and the 
PHOTPIPE transient streams is performed and agreement 
between the detection and photometry is now excellent, 
particularly after the application of uniform photometric 


calibration based on the ‘ubercal process (Schlafly et al. 
2012 Magnier et al.||2013 ). 


1 http://www.lsst.org/lsst/ 
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2.1 Artefacts in Difference Imaging 

In this work we only use detections from IPP difference 
imaging and not the independent PHOTPIPE detections. 
In Fig. |T] we show a modular diagram of IPP difference 
imaging process and the sources of the main types of 
artefact. 

The first source of bogus detections are chip defects, 
which take various forms. After detrending the chip data 
are resampled and geometrically warped to fit a unit area 
of sky that the data are projected onto, known as a sky cell. 
Occasionally a transient will lie on a region of the detector 
that when projected onto the sky falls on overlapping sky 
cells. This results in duplicate warp images of the same chip 
data, with the object lying close to one of the skycell edges. 
After warping, sky cell edges, chip defects and saturated 
sources are masked. Masked pixels in individual exposures 
are propagated through the stacking stage. 

A kernel is derived to degrade a high quality template 
image to match the nightly stack. The template is convolved 
with this kernel and subtracted from the nightly stack. 
This series of steps leads to a class of artefacts which we 
refer to as convolution issues. In general these arise from 
the derived kernel not being able to accurately match all 
sources in the template to those in the nightly stack. This 
causes problems with bright sources where the kernel is 
unable to fit the entire PSF of the detection in the nightly 
stack image. These artefacts appear as high signal-to-noise 
(S/N) PSFs but with darker rings appearing in the wings, 
an example is shown in the bottom panel of Fig. [4] We call 
these unclean subtractions. The flux in these detections 
is probably due to a bright stellar variable. Identifying 
variable stars (and AGNs) is quite a different problem to 
detecting transients and we have chosen not to try to tailor 
our algorithms to do both. The efforts in this paper are 
focused on finding transients, although inevitably stellar 
variables from very faint host stars are detected. Hence 
we discard these bright stellar variables that appear in the 
difference images as they are straightforward to identify. 
We find these detections make up ~10% of the bogus 
detections. 

The same convolution issues can lead to poor host 
galaxy subtraction, where an inadequately convolved host 
can be over or under subtracted leaving a pattern of positive 
and negative flux. This makes it difficult to disentangle 
any potential real detections. The third convolution issue 
we highlight in Fig. [l] arises when point-like sources in the 
template image are broader than that of the nightly stack 
resulting in an over subtraction in the wings of the source 
in the difference image. This happens when observing 
conditions have been particularly good and the nightly 
stack is of higher quality than the template image (this 
is not a frequent occurrence). The final artefacts from the 
convolution and subtraction stage are convolution problems 
in the cores of faint galaxies, manifesting themselves as 
faint nuclear transients and appearing as positive flux in 
the difference image. Here the convolution step matches the 
morphology of the faint galaxy in the template and nightly 
stacks well, however the peak flux of the convolved template 
is lower than that in the nightly stack. This results in the 
nucleus of the faint galaxy being under subtracted leaving 
residual flux in the difference image. These artefacts are the 


most difficult to identify by eye but are distinguished by a 
narrower PSF than expected. It is not always clear if the 
flux is due to real variability or an artefact of convolution, 
in any case these targets could not be confidently selected 
as real transients for follow-up. This highlights one of the 
major uncertainties in training the algorithms — secure 
labelling of real and bogus objects, which we return to in 
Sections 15.31 and 15.51 

Another source of artefacts arises during the source 
extraction phase. Flux in the nightly stack from diffraction 
spikes for example that have no equivalent in the template 
image get flagged as potential transients. We refer to these 
as spurious detections in Fig. [I] 

Our approach to date for removing these contaminants 
has been to attempt to derive a set of filters based on image 
statistics derived for each potential transient detection 
by IPP. These filters normally take the form of threshold 
values for some parameters (see Section 2.2). However, the 
parameter space is typically large and the work required to 
manually develop the optimal set of filters is impractical. 
Despite this our current hand-engineered checks allow only 
a small fraction of the bogus images through. This still 
produces on the order of a few hundred bogus objects 
each night passing the cuts and being presented to human 
scanners for verification. This is approaching the limit of 
what can comfortably be processed by humans on a daily 
basis and clearly a solution needs to be found for the next 
generation of survey. 

Over the course of the last ~3 years of the PS1 survey 
we have accumulated a large amount of data associated 
with a few tens of thousands of astronomical sources that 
have either been classified as real objects or artefacts 
using a combination of the cuts detailed in Section [2. 2 1 and 
human scanning. This readily available data lends itself 
to data-mining where we hope to use the historic data to 
improve on the current method of real-bogus classification. 
In Section |2.3| we outline how supervised learning can be 
applied to this archive of PS1 data in order to construct 
a real-bogus classifier that can be applied to the nightly 
stream of new data gathered from PS1 and future surveys. 
First we describe the cuts we perform. 


2.2 Cuts 

Prior to ingesting detections from IPP difference imag¬ 
ing into a MySQL database at Queen’s University Belfast 
(QUB) we perform pre-ingest cuts based on the detection of 
saturated, masked or suspected defective pixels within the 
PSF area. Taking as a typical night 3rd September 2013 
(56548 MJD (Modified Julian Date)), the 7 nightly stacks 
produced 366267 detections (~52000 detections per stack), 
the pre-ingest cuts rejected 94.88% of these detections. 

The ~ 18750 detections passing the pre-ingest cuts are 
associated with transient candidates if there are two or more 
quality detections within the last seven observations of the 
field, including detections in more than one filter, and an 
rms scatter in the positions of ^0.5 arcsec. Each quality 
detection must be of more than 3cr significance and have 
a Gaussian morphology (XYmoments <1.2). These post- 
ingest cuts also include checks for convolution issues, prox¬ 
imity to bright objects and ‘NaN’ values close to the centre 
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Figure 1. Modular diagram of the IPP difference imaging steps and the types of artefacts arising from each stage. 


of bright PSFs. 63% of the detections that passed the pre¬ 
ingest cuts were rejected during the post-ingest cuts. The 
remaining detections were promoted for human screening, 
where 37% of the detections were deemed to be real. These 
real transient candidates are cross-matched with catalogues 
of astronomical sources in the MDS fields. We use our own 
MDS catalogue and also extensive external catalogues (e.g. 
SDSS, GSC, 2MASS, NED, Milliqua^ Veron AGN, X-ray 
catalogues) to make a contextual classification of supernova, 
variable star, active galactic nuclei or nuclear transient. We 
also cross-match with the Minor Planet Centre to reject as¬ 
teroids, though most are removed during the construction 
of the nightly stacks. 


2.3 Supervised Learning for Classification 


In general supervised learning entails learning a model from 
a training set of data for which we provide the desired output 
for each training example. For the purposes of designating 
a detection as a real transient or a processing artefact, the 
desired output for each image is discrete. In such cases the 
problem is a supervised classification task for which there are 
a vast array of machine learning algorithms. In Section [4] we 
discuss the algorithms we try; however all such algorithms 
are trying to learn a model from the training data that will 
allow them to map the input parameterisation of each train¬ 


ing example (see Section 3.2) to the desired output or label , 
while at the same time ensuring the model performs well 
on data not seen during the training phase. For building a 
real-bogus classifier this is an obvious avenue to pursue as 
we have a large sample of historical data for which we have 
labels provided by our current cuts and also through human 
eyeballing. 

In Fig. [2] we show a sample of both real and bogus exam¬ 
ples drawn at random from the training data. Often bogus 
detections show a combination of the factors we describe in 
Section [2.1 1 and typically this affects the centroiding during 
the source detection stage. 


2 http://quasars.org/milliquas.htm 



Figure 2. Example detections randomly selected from the train¬ 
ing data. The 2 columns on the left show examples labelled as 
real and the 2 on the right show those labelled as bogus. Bogus 
detections a-c and f show signs of over subtraction, with a and 
f also showing masking, d is a faint galaxy convolution problem. 
Detections e and g are saturated sources that have been masked. 
Finally, h is an example of an unclean subtraction of a bright 
star. 


3 TRAINING SET AND FEATURE 
REPRESENTATION 

As discussed in the previous section we must provide a la¬ 
belled training set from which the classifier can learn to 
recognise the characteristics that can identify detections as 
being members of one of the classes: real or bogus. In order 
to learn a model that will generalise well to detections in new 
observations, it is important that detections in the training 
set are representative of all detections we expect to see. In 
practice this is easiest to achieve by providing the learn¬ 
ing algorithm with the largest possible training set, indeed 


Brink et al. ( 2013) attribute much of their improvement in 
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performance over Bloom et al. (2012) to using a training set 
with two orders of magnitude more training examples. In 
the remainder of this section we describe the compilation of 
the training set, starting with a description of our training 
example selection process and labelling. 


3.1 Training Set 


Over the past 3 years ~1 million potential transients have 
been catalogued in the MDS by the TSS. Approximately 
8000 of these objects have been selected by humans as real 
transients and promoted as potential targets for spectro¬ 
scopic follow-up. As of the end of the survey in May 2014, 
515 transients had spectroscopic classifications. 

The aggregate catalogue information for all objects ex¬ 
tracted by IPP and which pass the pre-ingest cuts described 
in Section [272] are stored in a database at QUB. Individual 
detections are associated with an object if they are spatially 
coincident within 0.5 arcsec. This information is presented 
to humans in the form of webpage^ The webpages show all 
the photometric points produced by IPP in a multi-colour 
lightcurve. The number of photometric detections typically 
ranges from a few to a few dozen depending on the magni¬ 
tude and timescale of the transient objects (s ee|Rest et al. 
(2014), McCrum et al. (2014), Chomiuk et al. (2011) for ex¬ 


amples of lightcurves). These webpages also present a sub¬ 
set of the image postage-stamps of the detections associated 
with an object (target image, reference image and difference 
image). This subset contains the first detections of the ob¬ 
ject of which there are always at least 2 (see Section 2.2) and 
up to 5 subsequent detections. Each object is then eyeballed 
by a human, those that appear to be real transients are 
promoted as potential targets for scientific follow-up, while 
artefacts are discarded. 

Our training examples are drawn from the subset of 
detections we choose to present on the human digestible 
webpages for each object, as detailed above (typically 2-3 
but less than 7). The majority of real examples were taken 
from detections of promoted objects with no spectroscopic 
classification. There is no guarantee that all detections of a 
promoted object are necessarily a result of good image sub¬ 
tractions. This prohibits simply assigning a label of real to 
all individual detections associated with a promoted target. 
In order to to ensure that we have a secure, reliable and clean 
set of real detections for training, we inspected and individu¬ 
ally labelled 4352 detections (from 1919 different transients) 
as real, discarding any artefacts from the training set. We 
augmented this sample of real detections with data from 53 
spectroscopically confirmed supernovae (from Dec. 2012 to 
Jan. 2014) for which we used the complete set of detections 
(~31 detections per object on average). These were again 
manually checked to remove bogus detections. We held out 
the first detections of all 53 SNe, which we use for testing in 
Section |5.7| an d all detections of PSl-13avb, which we use 
in Section [5.6| This leaves an additional 1603 real training 
examples bringing the total to 5955 real detections. 

Over the course of the survey approximately 800000 ob¬ 
jects have been discarded as artefacts providing on the or- 


3 Similar webpages are made public for the PS1 3Pi Survey at: 
http: / / star.pst.qub.ac.uk/pslthreepi / psdb/public/ 


Table 1. Composition of data sets. 


Set 

Real 

Bogus 

Total 

Training 

4800 

19271 

24071 

Test 

1619 

6405 

8024 

Total 

6419 

25676 

32095 


der of 10 6 examples of bogus detections. We randomly sam¬ 
ple from the available bogus examples and aim for 4 times 
more bogus examples as real, this is similar to the propor¬ 
tions used by Brink et al. (2013). Initial tests with classifiers 
showed that a significant proportion of the false positives ap¬ 
peared to be clean subtractions. We improved the purity of 
the bogus sample by examining the randomly selected bo¬ 
gus detections and added any detections that looked like real 
transient subtractions to the list of real examples (the effect 
of label contamination is further discussed in Section |5.3| ) . 
This produced an extra 464 examples for the set of real de¬ 
tections resulting in a final total of 6419. We then selected 
4 times as many bogus images from the remainder of the 
bogus examples we inspected, producing a sample of 25676 
bogus detections. 

The final training set contains 32095 training examples. 
We divide the training examples into 2 sets, distributed as 
follows; 75% for training and cross validation, and 25% for 
testing. The training examples are randomly shuffled prior 
to splitting with the caveat that all detections on the same 
night of a given object are included in the same set. This 
is to avoid detections with almost identical statistics being 
in multiple sets and giving a false impression of a classifiers 
performance. The construction of the data set is summarised 
in Table [l] The label for each training example is a 1 or 0, 
with 1 representing a label of real and 0 bogus. 

The training set we have constructed is representative, 
containing examples of detections from different chips, see¬ 
ing conditions, and filters, with various levels of S/N and 
examples of all types of processing artefact. 


3.2 Feature Representation 


Machine learning algorithms require a 1-dimensional (1-D) 
vector representation of each training example, where each 
element of the vector corresponds to some numeric data or 
feature that may be useful to the algorithm for discerning 
examples belonging to each class. Previous work in the area 
of real-bogus classification, has focused on using parameters 
contained in catalogues generated by the processing pipeline 
and more complex features derived from that information to 


represent the detections, see Table 1 from Brink et al. (2013) 


and Table 1 from Romano, Aragon & Ding (2006). 


The catalogue features available to individual surveys 
depend on the implementation of their image processing 
pipeline. When applying machine learning for real-bogus 
classification to a new survey it may not be possible to cal¬ 
culate these features based on the information available in 
the catalogues. There is also potential to spend a lot of time 
deriving and testing ways to combine the catalogue infor¬ 
mation that is available into features that we hope capture 
the differences between real and bogus detections. Bogus 
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detections are the result of many factors and establishing 

the same normalisation function used by EyEj 4 

(Bertin 

2001) 

a set of features that can encapsulate them all is difficult. 

and similar to that of Romano, Aragon &; Ding (2006 

)• 


In contrast simply representing the detections by their pixel 
intensity values requires no time spent developing or tuning 
feature extractors. Previous work that relies solely on the 
pixel data has proven effective for simple visual classifica¬ 
tion tasks, such as hand written digits (LeCun et al. 1998). 


For more complex tasks or to boost performance much of 
this work has been performed by learning a hierarchy of un¬ 


supervised features from the pixel data (Coates, Lee, & Ng 


(2011); LeCun et al. (1998)). Establishing a firm benchmark 
on the pixel intensity representation allows us to assess the 
potential gains from applying these more complex methods 
and is the main focus of this paper. Using this representation 
we expect the learning algorithm to identify salient relation¬ 
ships between pixels for the classification task. In the next 
section we discuss our choice of features and continue in the 
following section by describing the preprocessing steps we 
apply before training. 


3.2.1 Feature Vector Construction 


4 OPTIMISATION OF THE CLASSIFICATION 

SYSTEM 

In order to achieve the best performance from the machine 
learning algorithms discussed in the following sections it is 
necessary to optimise the hyperparameters of each. This is 
done by a process known as cross validation which is a brute 
force search of the hyperparameter space, where a model is 
trained with the hyperparameters selected at predetermined 
intervals within the space. The best combination is selected 
by measuring the performance in a held out sample of the 
24071 training examples. 

Below we give a brief introduction to each of the clas¬ 
sifiers. We also point out the free parameters that must be 
selected by cross validation and discuss this process in depth 
in Section [4.4.1| To end this section on optimisation we show 
the performance of each classifier on the out of sample data 
in the test set. 


To represent our training examples, we use the pixel data 
itself. For a given training example, we construct its feature 
vector by selecting a 20x20 pixel area (corresponding to ~5 
times the average seeing of PS1) around the centre of what 
IPP considers a transient, which we refer to as a substamp. 
The 1-D vector is constructed by shifting off each column of 
the substamp and concatenating those columns together to 
produce a 400 element vector of pixel intensity values. 

In Fig. [3] we show visualisations of these feature vectors 
along with the substamp from which they were constructed 
for examples of real detections and for various levels of S/N. 
In Fig. [4] we show detections labelled as bogus with examples 
of different types of artefact. A learning algorithm will learn 
to identify patterns in the feature vectors that are charac¬ 
teristic of examples belonging to the two classes. 

The choice of feature representation is independent of 
the implementation of the rest of the image processing 
pipeline and survey, with the assumption that the pixel level 
data is easily accessible. 


3.2.2 Feature Preprocessing 

Aside from the image processing steps carried out by the 
pipeline, we carry out 2 additional transformations of the 
data. We first replace any ‘NaN’ pixel values with 0s. ‘NaN’ 
pixel values typically arise from masking or floating point 
overflows during image processing. We choose to replace 
these pixel values with 0 so as not to influence the next 
step in the preprocessing phase. As a second step we apply 
a feature normalisation function which allows classifiers to 
focus on relative pixel intensities and limits the effect of ab¬ 
solute brightness on the classifiers. We apply the following 
normalisation: 

f{x) = w\ l09 ( 1+l ^) (1) 

where x is a feature vector and a is the standard deviation 
of the pixel intensity values for that feature vector. This is 


4.1 Artificial Neural Networks 


Artificial Neural Networks (ANN) comprise a number of in¬ 
terconnected nodes arranged into a series of layers. In this 
study we limit ourselves to a 3-layer ANN (consisting of an 
input layer, a hidden layer and an output layer) as those with 
more than one hidden layer need more careful training and 


require more computational power (Hinton, Osindero, & Teh 


2006). For our purposes we train feed-forward ANNs with 


back-propagation and randomly initialised weights, where 
the activation of each node is calculated with the logistic 
(sigmoid) function. 

By limiting many of the choices for the structure of the 
ANNs we remove the need to select these hyperparameters 
during the cross validation phase in Section [4.4.1| which sig¬ 
nificantly reduces the complexity of the space we have to 
search. This economy of computation comes at the cost of 
not testing regions of the parameter space (e.g. other acti¬ 
vation functions) and restricting the representational power 
of the ANNs by requiring a single hidden layer. We are how¬ 
ever left with only 2 hyperparamters to choose namely, the 
number of nodes that make up the hidden layer, S 2 and 
the regularisation parameter, A through which we attempt 


to prevent overfitting. There is some suggestion (Murtagh 
1991 Geva k, Sitte|1992| that the optimal number of nodes 
in the hidden layer ($ 2 ) is 2n + 1, where n is the number of 
input features. In our case n is fixed at 400 input features, 
suggesting we should train ANNs with S 2 = 801 nodes, how¬ 
ever training such large networks is beyond the scope of this 
work and we instead choose to test values of S 2 in the range 
25-200. 

We use our own vectorised implementation of ANNs 
written in pythorQ The code relies on numpjj^Jfor efficient 


4 http://www.astromatic.net/software/eye 

5 https://www.python.org 

6 http://www.numpy.org 
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Figure 3. Visualisation of feature vectors for detections labelled Figure 4. Similar to Fig. [3] but for bogus examples. 

as real. The feature vectors are constructed by shifting off each 

column of the 20x20 pixel substamp on the left and appending 

them together to produce the 400 element 1-D feature vector 

depicted on the right. 
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8 D.E. Wright et al. 


array manipulations and scipjQfor optimisation of the ob- 
jective function. 


4.2 Random Forests 


Random Forests (RFs) aim to classify examples by build¬ 
ing many decision trees from bootstrapped (sampled with 
replacement) versions of the training data (Breim an||2001 ). 
Classifications are then assigned based on the average of the 
ensemble of decision trees. Each individual tree is grown by 
randomly sampling k features from the n input features and 
selecting the feature that best separates real examples from 
bogus as informed by the gini function. We use scikit-learn’^] 
implementation of RFs where we select hyperparameters by 
assigning values to variables n_estimators, max_features 
and min_samples_leaf; the total number of trees in the en¬ 
semble, the number of features considered at each split and 
the minimum number of examples that define a leaf, below 
which no further splitting is allowed. RFs provide the ability 
to estimate the importance of each feature which we use in 
Section 15 T 21 


4.3 Support Vector Machines 

Support Vector Machines (SVMs) 
aim to find the hyperplane in the input feature space that 
optimally classifies training examples for linearly separable 
patterns, while simultaneously maximising the margin, the 
distance between the training examples which lie closest to 
the hyperplane, known as the support vectors. SVMs can 
be extended to non-linear patterns with the inclusion of a 
kernel, where the kernel transforms the original input data 
into a new parameter space. We again use scikit-learn’s im¬ 
plementation of SVMs where we choose the free parameters 
namely, the penalty parameter, C (similar to A for ANNs) 
and the kernel parameter gamma, which controls the local in¬ 
fluence that support vectors have on the decision boundary. 
We only try SVMs with a Radial Basis Function (RBF) ker¬ 
nel, this being the most common choice and again reduces 
the parameter space that must be searched. 


(Cortes & Vapnikj 1995) 


4.4 Model Selection 


For each algorithm discussed above we need a method to 
choose the optimal combination of hyperparameters that 
will achieve the best performance for the classification task. 
In order to compare the relative performance of the differ¬ 
ent models we need some Figure of Merit (FoM). We use 
the FoM of Brink et al. (2013) which captures the essence 


of the problem we are trying to solve. The FoM is defined 
as the minimum Missed Detection Rate (MDR) (False Neg¬ 
ative Rate) that gives a False Positive Rate (FPR) of 1%. 
That is, assuming we are willing to accept that 1% of the 
images deemed real by the classifier and promoted to hu¬ 
man scanners will turn out to be bogus, what fraction of 
the real images would be discarded? With this we can select 
the model that would discard the least real images while 1% 
of images classified as real can be expected to be bogus. 


7 http://docs.scipy.org/doc/scipy/reference/index.html 

8 http://scikit-learn.org/stable/index.html 


4-4-1 Cross Validation 

When calculating the FoM to compare the relative perfor¬ 
mance of models, it is important that the measurement is 
made on data that the model has not inspected during the 
training phase, otherwise we risk measuring the performance 
on data that the model has overfit and report an FoM that 
we cannot expect to achieve on out of sample data. To mit¬ 
igate this effect we split the data we designated for training 
in Section [3.1 1 into 5 subsets or folds with equal numbers of 
training examples. We then train each model on 4 of these 
folds and use the fifth as a validation set to measure the 
performance. The model is then retrained on 4 folds but a 
different fold is held out. In total the model is trained 5 
times with each fold being held out once. We then average 
the results for the 5-folds and choose the model that re¬ 
sults in the best average FoM. A second advantage is that 
for relatively small data sets where the composition of the 
validation set may not be representative of the entire pop¬ 
ulation, by evaluating the performance on each fold in turn 
and then averaging, we achieve a better estimate of the ac¬ 
tual performance on the entire data set. 

In our case all 3 classifiers output a prediction or hy¬ 
pothesis for each example. These hypotheses can be thought 
of as the probability a given example has of belonging to 
the class of real images, taking on values in the range 0- 
1. A classifier predicts detections with hypotheses close to 
1 are highly likely real transients, while those close to 0 
are bogus. In Fig. [5] we plot the distribution of hypothesis 
values for a RF with n_estimators=100, max_features=25 
and min_samples_leaf=1 trained on 4 folds of the training 
set. The distribution plotted shows the hypotheses for the 
held-out fifth fold. To assign a label of real or bogus we 
must define a decision boundary; a hypothesis value above 
which the classifier labels detections as real, otherwise detec¬ 
tions are labelled bogus. If the classifier has learnt a useful 
model it should output detections labelled as bogus with a 
hypothesis below the decision boundary and those labelled 
as real above the decision boundary for the prediction to be 
correct. Bogus detections with predictions above the deci¬ 
sion boundary are False Positives and real detections with 
hypotheses below the decision boundary are Missed Detec¬ 
tions. For our FoM the decision boundary is selected as the 
hypothesis value above which only 1% of the bogus detec¬ 
tions lie (dashed line in Fig. |5|. The FoM is the fraction of 
the detections labelled as real that lie below this choice of 
decision boundary. During 5-fold cross validation a hypoth¬ 
esis distribution is generated by predicting hypotheses for 
the detections in each of the held-out folds. 

In Fig. [6] we show an example of the 5-fold cross 
validation process for an RF with max_features=25 and 
min_samples_leaf=1. In this example we vary the number 
of decision trees, n_estimators and plot a Receiver Opera¬ 
tor Characteristic (ROC) curve for each model. ROC curves 
are produced by varying the decision boundary at which we 
assign a prediction to a label of real or bogus and calculate 
the FPR and MDR that decision boundary produces for the 
validation set. From the example in Fig.[6]we see that select¬ 
ing a value of 100 for n_estimators produces the best FoM 
of ~0.167, this means that an FPR=1% produces a MDR 
of 16.7%. We also include 5% and 10% FPR levels for ref¬ 
erence. We repeated this process for various sizes of hidden 
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Figure 5. Hypothesis distribution produced during one permu¬ 
tation of 5-fold cross validation. The hypotheses shown are for 
images in the held-out fold. Green shows the hypotheses for the 
validation examples labelled as artefacts and red those labelled as 
real. The decision boundary is selected such that the fraction of 
detections labelled as bogus lying above the decision boundary is 
0.01. The False Positive Rate can be visualised as the fraction of 
green bars with a prediction greater than the decision boundary, 
the MDR is the fraction of red bars with predictions less than the 
decision boundary. The first interval has a frequency of 2376, but 
the plot is truncated for clarity. 


layer. We also show an example of measuring the FoM on a 
data set containing a significant proportion of the training 
data, labelled as over fit in Fig. [6] 

By replicating this process for both ANNs and SVMs 
we were able to select the optimal set of hyperparameters 
for each algorithm. In the second column of Table [2] we show 
the optimal hyperparamters selected for each algorithm by 
cross validation. By using the validation sets to select the 
hyperparameters, there is a danger that the hyperparame¬ 
ters will in effect have been fit to these sets. As a result, the 
FoM we measure on the validation sets is not an unbiased 
measurement of the performance we would expect to achieve 
on data not included in the training folds. We deal with this 
in the next section. 


Testing 

Having selected the optimal model for each of the algorithms 
we retrain these models with the entire training set. This al¬ 
lows the models to learn from more examples. To measure 
how well we expect the models selected by cross validation 
in the last section to perform on unseen data we measure 
the FoM on the test set, the 25% of the data we held back 
from both training and validation. This provides an unbi¬ 
ased estimate of the performance. In Table [2] we show the 
FoM measured on the test set. Fig. shows the ROC curve 
for each model in Table [2j We find that the RF is the best 
classifier with a FoM of 0.106. 

Fig. 0> shows a close-up of the measured FoM for the 
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Figure 6. An example of the cross validation process for a RF 
with max_f eatures=25 and min_samples_leaf = 1. 


RF classifier, where the measured FoM is shown along with 
the performance we would expect to achieve if we were to 
allow 5% or 10% of the bogus detections through to human 
scanners. For example, allowing the FPR to slip to 5% in¬ 
creases the completeness to 97.6%. We also plot the hypoth¬ 
esis distribution for the detections in the test set in Fig. [8] 
The FoM shown in Fig. [T]d is the single best classifier we 
find in our analysis. Using this classifier on a data stream of 
nightly observations from PS1, we would expect that 99% of 
the detections promoted to humans would be of real astro- 
physical transients while 10.6% of the real detections would 
be rejected by the classifier. Brink et al. (2013) report a 
MDR of 7.7% for their system. As a next step it is useful to 
investigate the detections for which the classifier produces 
incorrect predictions to see if there are systematic errors that 
the classifier makes or if it is making correct predictions for 
detections that have been labelled incorrectly during the 
construction of the training set. 


5 FURTHER ANALYSIS 

In this section we attempt to get a better sense of how we 
expect the classifier to perform in practice by characterising 
its performance under various conditions. We aim to iden¬ 
tify trends in the kinds of detections for which the classifier 
makes incorrect predictions and investigate the effect that 
providing the classifier with incorrectly labelled training and 
test sets has on the measured FoM. However, we begin this 
section by looking at methods to boost performance by com¬ 
bining classifiers. 


5.1 Combining Classifiers 

As a last step toward boosting performance we investigated 
a selection of methods to combine the RF, SVM and ANN 
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Table 2. Comparison of learning algorithms. 


Classifier 

Model Parameters 

Threshold 

FoM 

Artificial Neural Network 

S2=200, A = 5 

0.547 

0.233 

Support Vector Machine (RBF) 

C=3, gamma=0.01 

0.788 

0.196 

Random Forest 

n_estimators=1000, max_features=25, min_samples_leaf = 1 

0.539 

0.106 


a) 



b) 



Missed Detection Rate (MDR) 


Figure 7. a) Comparison of the best models for various learning 
algorithms applied to the held out test set. b) Detail of ROC 
curve of the best performing classifier, the Random Forest shown 
in a). At a FPR of 1% the FoM shows that in practice we expect 
to operate at a MDR of 10.6%. 


from Table [2] The predictions of the 3 methods are corre¬ 
lated; a candidate highly ranked by the RF is likely to also 
be highly ranked by the other 2 classifiers, but there are still 
detections of real transients that are discarded by only one 
of the classifiers. From Fig.[9]there are 24 detections labelled 



Hypothesis 


Figure 8. Hypothesis distribution for the optimal Random Forest 
classifier applied to the test set. 


Table 3. Results of combining classifiers. 


Method FPR MDR 


Majority Vote 

0.02 

0.06 

Mean Hypotheses 

0.01 

0.12 

Hypotheses as Features 

0.01 

0.12 


as real that only the RF wrongly rejects, it is these examples 
that we hope to recover by combining classifiers. 

We tried only a few of the simplest combination strate¬ 
gies. First we simply classified a detection based on the ma¬ 
jority vote of the 3 classifiers. Second we assigned each de¬ 
tection a hypothesis that was the mean of the hypothesis 
values output by each classifier. This produced a new dis¬ 
tribution of mean hypotheses, where we again selected the 
decision boundary to produce the FoM. Finally we trained 
a SVM using the 3 hypotheses for each detection as the fea¬ 
tures representing that detection. In the end none of these 
methods outperformed the RF classifier, though the perfor¬ 
mance was comparable (see Table [3]). 

This result is unsurprising given that the classifiers are 
highly correlated and there is no guarantee that these meth¬ 


ods will outperform the best individual classifier (Fumera & 
Roli|2005 ). The RF is in itself an ensemble of classifiers (the 


individual decision trees) and may already incorporate much 
of the gain in performance we can expect from these simple 
methods. 
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Figure 9. Venn diagram showing the relationship between the 
Missed Detections for each classifier. There are 1619 positive ex¬ 
amples in the test set. 


5.2 Relative Feature Importance 


Random Forests provide a built-in method to estimate the 
relative importance of each feature to the classification 


(Breiman 2001). By inspecting the ‘depth’ at which each 
feature is used as a decision node we can estimate the rela¬ 
tive importance of that feature, as those features used closer 
to the top of the tree will contribute to the prediction of 
a larger fraction of the training examples. The fraction of 
samples for which we expect a feature to contribute to the 
classification can be used to gauge its relative importance. 

Fig. |10| shows the relative importance of each pixel de¬ 
termined from the training set. The relative importance met¬ 
ric is normalised such that it sums to 1. The most important 
features have the highest values and as would be expected 
are located in the centre of the image. The pixels on the 
edges of the images are thought to be important for identi¬ 
fying many of the bogus examples, where the object is not 
centred in the substamp and often lies at the edge. For ref¬ 
erence if features were equally important they would each 
have a relative importance of 1/400 = 0.0025. 

Fig. [TO] may suggest some redundancy in the features 
bounding the central pixels. It is expected that omitting 
these features would have little effect on the performance of 
our classifier as RFs are thought to be unaffected by the in¬ 
clusion of noise variables in the feature vector (Biau 2010). 
In contrast Brink et al. (2013) find that the MDR for their 


RF classifier improves by ~4% by omitting noisy features 
using a backward feature selection method. The effect of 
feature selection is an interesting area for future work and 
attempts at optimisation. 


5.3 Label Contamination 


We took care to eliminate label contamination in Section |3T) 
by visually checking and manually labelling each training ex¬ 
ample. Nonetheless we expect that there remain some exam¬ 
ples with incorrect labels. In this section we employ similar 
methods to those in Brink et al. (2013) to investigate the 
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Figure 10. The relative importance of each pixel to the classifi¬ 
cation for the Random Forest. The contributions of each feature 
are normalised such that they sum to 1. 


effect that label contamination has on our ability to train 
and test the optimal RF model. 

First we investigate the effect of adding label contam¬ 
ination to the training set. We add contamination by ran¬ 
domly selecting a subset of the detections from the training 
set and flipping their labels. Those labelled as real are now 
labelled as bogus and vice versa. In Fig.[lT]we plot the effect 
of randomly flipping labels in the training set while leaving 
the original labels in the test set untouched. The measured 
MDR appears fairly unaffected up to around 6% contamina¬ 
tion. The approach of Brink et al. (2013) is robust to around 
10 % suggesting our method may be more susceptible to in¬ 
correctly labelled training data. 

Next we flip labels in the test set, while using the orig¬ 
inal training set labels as they are. Given that the RF has 
been trained with correctly labelled data, for the most part 
we expect it to provide the correct labels for the images in 
the test set. However, the flipped labels affect our ability to 
accurately measure the FoM. Although the classifier makes 
sensible predictions, when we compare these predictions to 
the flipped labels the otherwise correct predictions are now 
evaluated as False Positives or Missed Detections. Fig. [TT| 
shows how the FoM is affected as we increase the fraction of 
flipped labels, we see that even at low proportions labelling 
noise in the test set can have a significant effect. 


5.4 Classification as a Function of Signal-to-Noise 


To investigate the classifier performance as a function of 
Signal-to-Noise (S/N), we also follow a similar analysis to 
Brink et al. (2013). We plot the distribution of magnitudes 
for each example in the test set labelled as real in Fig. |12| 
We divide the examples into 11 bins, each spanning 1 mag¬ 
nitude in the range 13 to 24 mag. We then use the classi¬ 
fier to make a prediction for the examples in each bin and 
calculate the fraction of examples classified as bogus which 
we take as an estimate of the classifier performance for ob¬ 
jects at that level of S/N. For objects with magnitudes >20 
there is a ~6% chance of missing real detections. Counter¬ 
intuitively the detection performance deteriorates for higher 
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Figure 11. The effect of randomly flipping labels. As we increase 
the fraction of the images for which we flip the label i.e. as we 
introduce more label contamination, the performance of the clas¬ 
sifier trained on the contaminated training set and measured on 
the untouched test set (blue line) decreases as expected. Introduc¬ 
ing contamination into the test set has a much more pronounced 
effect on the measured performance even at low fractions (green 
line). 


S/N objects. The number of examples of these cases are low 
as typically these objects result in artefacts from saturation 
and subsequent masking or unclean subtractions. However, 
this can also be understood as an effect of our feature rep¬ 
resentation, where we are learning classifications based on 
the relative intensity of pixels across the substamp. The ten¬ 
dency to misclassify such detections could stem from a com¬ 
bination of the large relative intensity differences between 
pixels in these substamps that often characterise artefacts 
and the low numbers of high S/N images of real transients. 
This explanation is further supported by both the ANN and 
SVM, which also misclassify these objects, suggesting the 
issue is with the data and not a consequence of the realisa¬ 
tion of the RF. In the next section we try to identify any 
relationships in the missed detections. 


5.5 Missed Detections 


We inspected the 172 missed detections (see Fig. 13) look¬ 
ing for similarities that may explain why they were rejected. 
We found that these missed detections are associated with 
112 individual transients. Although we took care to limit 
label contamination during the construction of the training 
set, we identified some examples of obvious bogus detections 
mislabelled as real that account for a small fraction (~1%) 
of the missed detections. 

We also find about 29% of the missed detections appear 
to be a result of faint galaxy convolution problems (see Sec¬ 
tion 2.1). These artefacts are difficult to identify by eye and 


as a result have been incorrectly labelled as real detections 
significantly contributing to the label contamination of the 
test set. 

In Section |5.4| we discussed the high MDRs for bright 



Figure 12. Histogram of magnitudes for the test set examples 
labelled as real. We also show the MDR as a function of S/N 
which increases dramatically for sources brighter than magnitude 
20 . 


sources. In Fig. [l4]we plot the hypothesis values for all de¬ 
tections included in the histogram of Fig. 12 (i.e. all test set 


detections that have been visually classified as real) against 
their magnitude reported by IPP. A feature of the plot that 
stands out is the cluster of sources with magnitudes brighter 
than 16 and hypotheses less than 0.2. Magnier et al. (2013) 
report that for the PS1 37r survey, saturation occurs at ~ 13.5 
for gpi, rpi, ipi, ~13.0 for zpi and ~12.0 in ypi. We were 
concerned that these sources could be saturated, however to 
conclusively determine this the individual images that are 
combined to make a nightly stack would need to be exam¬ 
ined. Instead we scaled the magnitudes reported by |Magnier| 


et al. (2013) for PS1 exposures by the exposure times for 


the individual images that make up a nightly stack and set 
a magnitude limit of 16 mag. Objects brighter than this 
limit may have saturated cores in some exposures and can¬ 
not safely be labelled as real. Some of these sources on close 
inspection also show signs of the unclean subtractions we 
highlighted in Section [271] 

The detections brighter than 16 mag in Fig. with 
hypotheses above 0.2 are all associated with a single con¬ 


firmed supernova (SN), SN 2014bc (PSl-14xz)(Smartt et al. 


2014). SN 2014bc is a nearby (7.6 Mpc) Type-IIP located ir 


the bright host galaxy NGC4258 (Messier 106). The tran¬ 
sient lies close to the core of the host and as a consequence 
the host has been poorly subtracted in the same location in 
all the substamps. Detections of this object appear in both 
the training and test set and although we ensured detections 
from the same night must appear in the same set, the slowly 
evolving plateau has resulted in detections with similar S/N 
and the same pattern of poor subtraction appearing in both. 
It is therefore to be suspected that test set detections asso¬ 
ciated with this SN would have been rejected along with the 
other sources brighter than 16 mags had similar detections 
not been included in the training set. This raises the issue of 
potentially missing the brightest transients which are often 
of interest and the cheapest to classify spectroscopically, we 
return to this in Section |6] 

The high MDRs in the magnitude range 16-20 still re- 
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poor host 
subtractions 



faint galaxy 
convolution problems 



Figure 13. The 172 detections labelled as real but classified as bogus by the RF. Detections are grouped according to the discussion in 
Section [53] The hypotheses for the detections are shown as the inset numbers. 


main unexplained. To address this in Fig. [15] we plot the 
number of examples of real transients in each of the magni¬ 
tude bins used in Section |5.4| for both the training and test 
sets. The plot clearly shows the deficit in training examples 
at magnitudes brighter than 20 and lead us to conclude that 
we lack enough training examples of high S/N transients to 
allow the classifier to learn a model that generalises well in 
this regime. In Fig.[l5]we overlay the relative size of the test 
set compared with the training set in each bin. We selected 
the test set by randomly sampling 25% of the data available 
for training. The small fractions of test examples available 
between 16 and 18 mags combined with the low numbers in 
the range 16-20 mags severely impact our ability to accu¬ 
rately measure the MDR in this range. 

Aside from the issues associated with high S/N, there 
are a few other SNe with detections that show similar host 
galaxy subtraction problems to SN 2014bc. Some of these are 
true bogus detections which we show in Fig. [13] Approxi¬ 
mately 9% of the missed detections are bogus detections 
around poor host subtractions. We include detections of SN 
2014bc with this group in Fig[l3] though these detections 
around 15th magnitude could equally have been included 
with the bright sources. 

Among the missed detections we also found substamps 
where entire rows or columns along an edge of a substamp 
had been masked. In the second panel of Fig. [3] we show 
an example where the bottom 2 rows of pixels have been 
masked. These are examples of the sky cell duplicates we de¬ 
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Figure 14. Plot showing the hypotheses of the real examples in 
the test set against the magnitude measured by IPP. The shaded 
region shows the magnitude cut above which we cannot be certain 
the nightly stacks do not contain saturated images. 


scribe in Section 12.11 We were concerned the classifier was 
rejecting these detections based on the masking. To see if 
this was the case we identified all the examples of sky cell 
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overlap among the real test set detections, and found 20. As 
we ensured that detections from the same night must be in 
the same set (training or test set), the equivalent full 20x20 
pixel substamps were also in the test set. We compared the 
performance on the full pixel substamp with that of the par¬ 
tially masked substamp and found that there is only 1 case 
where the masked substamp was rejected while the full pixel 
substamp was kept. In this instance a significant proportion 
of the substamp was masked (7 columns) with the edge lying 
close to the PSF. The majority of the remaining substamp 
pairs were both assigned the same classification. There are 
however 6 pairs where the masked substamp was correctly 
classified as real, but the full pixel substamp was rejected, 
showing the classifier does not tend to reject detections with 
sky cell masking simply due to the masked regions. 

The reason for rejecting one detection from the pair 
over the other is unclear as both substamps are constructed 
from the same data. The 6 pairs for which the full pixel 
substamp was labelled bogus, but the masked substamp 
was labelled real are all associated with a single transient 
and may not apply to other sky cell pairs. For these sub¬ 
stamp pairs we found that the centroids always differed by 
1 pixel and were offset in the same direction. We tried shift¬ 
ing the centre of the stamps to the same pixel, but found 
that this had little impact on the hypothesis. In all cases the 
flux-conserving warping results in equivalent pixels contain¬ 
ing different counts, though the difference is typically small 
<10%. Given the small number of cases where the detec¬ 
tions of a sky cell pair are assigned to different classes (7 
in total) and that these detections are associated with only 
two transients (6 associated with a single transient where 
the full pixel substamp is rejected and 1 associated with a 
different transient where the masked substamp is rejected), 
it is difficult to explain this behaviour, though one explana¬ 
tion may be the small differences in pixel intensity values 
perhaps combined with the different centroids. 

The nuances of difference imaging make it difficult to 
determine the ground truth label for each detection. Hu¬ 
mans often require additional information beyond that con¬ 
tained in the single difference image e.g. position relative to 
the host, or the number of bad/good pixels visible in the 
input image. The investigations above suggest that the clas¬ 
sifier is identifying subtle relationships and correctly iden¬ 
tifying that many of the ‘missed detections’ are dubiously 
labelled as real. We estimate that 45% (~5% bright sources; 
~29% convolution problems; ~9% poor host subtractions; 
~1% obvious mislabelled artefacts) or about 77 of the missed 
detections are not of high enough quality to be confidently 
labelled as real detections. Therefore the RF classifier is not 
strictly getting them wrong. The high proportions of these 
cases among the missed detections does not hold true for 
the entire sample of real detections in the test set, where 
for example faint galaxy convolution problems are crudely 
estimated to account for no more than 7%. Removing such 
detections from our test set results in an MDR around 6.2%. 
The MDR of our classifier is therefore in the range 6.2 - 
10.6% for a FPR of 1% but most likely toward the lower end 
of this range. The remaining 95 detections are true missed 
detections and appear to be mislabelled by the classifier due 
to high S/N as discussed above, poor seeing conditions and 
very low S/N detections near the detection limit. 



Figure 15. The magnitude distributions of the real examples in 
the training and test sets. The numbers on each bin show the total 
number of images in the training set. We also show the relative 
numbers of test set examples in each bin (blue line). The shaded 
region again shows the magnitude cut defined in Section |5.5| 


5.6 Medium Deep Confirmed Supernovae 


In order to demonstrate how we might expect the classifier 
to perform on a live data stream we first use the classifier 
to make predictions for the supernova PSl-13avb for which 
we held out all associated detections from both the test and 
training set. This object has been spectroscopically classi¬ 
fied as a Type lb SN and has a well sampled lightcurve from 
about -18 days pre-maximum to around 106 days post max¬ 
imum, including exposures in all 5 filters ranging in mag¬ 
nitude from around 23 to 20 mags (see Fig. 


16 


top panel). 

We selected this object for its high quality lightcurve and 
magnitude range which represents the majority of objects 
discovered in the PS1 MDS. In the bottom panel of Fig. [TB] 
we show the hypothesis for each epoch of this target. The 
plot shows that the hypothesis is consistently above the de¬ 
cision boundary of 0.539 (selected in Section 4.4.2) with the 
exception of the detection from 56480.406 MJD (Modified 
Julian Date) which shows the transient at a magnitude of 
gPi = 23.12T0.21 approaching the detection limit in this 
filter. The detection is displayed as an inset in Fig. |16| with 
its hypothesis of 0.506, showing the low S/N and deviation 
from a PSF-like morphology. 


5.7 Early Detection 


One of the major aims of recent supernova searches has been 
to try to detect the transient as soon after explosion as pos¬ 
sible in order to trigger rapid follow-up to spectroscopically 
study regions of the transients evolution that remain rela¬ 
tively unexplored (Gal-Yam et al. (2014); Cao et al. (2013)). 
To this end we carry out a simple test by using the classifier 
to make predictions for the first detections of all 53 clas¬ 
sified SNe in our database. Again we held these detections 
out from the training and test sets. In Table [4] we list the 
53 SNe and the details of the first detections along with the 
hypothesis for each detection. The classifier correctly pre¬ 
dicts all detections as real and had it been running on a live 
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Figure 16. (top panel) PS1 lightcurve of the Type lb Supernova 
PSl-13avb. (bottom panel) The hypothesis for each epoch. The 
dashed line shows the decision boundary (0.539) below which the 
classifier predicts an image as bogus, (inset) The only missed 
detection for this SN which shows low S/N. 


data stream would have promoted all objects to humans for 
follow-up. 


6 SUMMARY OF RESULTS AND 
CONCLUSIONS 

In this work we have constructed a data set of detections 
from the Pan-STARRSl Medium Deep Survey. We used this 
data set to train a Random Forest classifier to reject bogus 
detections of transients before they are presented to humans 
as potential targets for follow-up. As the feature represen¬ 
tation of these detections we used the pixel intensity values 
of a 20x20 pixel substamp centred on the detection. This 
choice is independent of the observing strategy and removes 
the need for careful feature design and selection that requires 
specific domain knowledge. The choice of features also make 
this method applicable to any survey performing difference 
imaging and requires no information from either the tem¬ 
plate image or nightly stack. Using the Figure of Merit as 
defined in Brink et al. (2013) we selected the decision bound¬ 


ary such that objects classified as real should be 99% pure, 
which resulted in a best estimate of a Missed Detection Rate 
of 6.2% (i.e. 93.8% complete) and can compete with previous 
work in this area. We further tested the classifier by apply¬ 
ing it to the lightcurve of a Type-lb supernova and found 
only one missed detection out of 74. The missed detection 
had low signal-to-noise. In addition, to assess the classifiers 


performance for early detection, we used the classifier to 
make predictions for the first detections of 53 spectroscopi¬ 
cally confirmed SNe in our database and found none would 
have been rejected. 

We discovered our classifier struggles to provide accu¬ 
rate classifications for the brightest sources (<19 mags). 
Many of these are associated with bright variable stars and 
have ringing patterns due to the kernel size definition, which 
leads to labelling difficulties. Some are also close to the sat¬ 
uration limit which may cause the algorithms to misiden- 
tify real sources as bogus. The mathematical problem in 
detecting bright variable stars in difference images is clearly 
quite distinct from finding low flux and moderate flux level 
transients in, or near extended galaxies. Furthermore the 
scientific goal in characterising variability of stellar sources 
is typically based on total flux measurements whereas find¬ 
ing explosive transients requires the resolved and unresolved 
galaxies to be subtracted. Our methods are tailored to¬ 
ward the latter, and can certainly not be blindly applied to 
uncover complete populations of variable stars or variable 
AGNs. With a goal of discovering extragalactic transients, 
one is content to ignore stellar variables in a data stream, 
although we show here that the algorithms can sometimes 
misclassify bright and high signal-to-noise explosive tran¬ 
sients. 

We also found the MDR is consistently higher for 
sources brighter than 20 mags which we attribute to the lack 
of training data in this range. We would expect that provid¬ 
ing more training examples that are representative of these 
objects would reduce the MDR for brighter sources. In this 
paper we have only used a sample of the data from the PS1 
MDS, but we have access to the full database of MDS tran¬ 
sients, which could be used to provide more training data. 
In addition we also have data from PS1 3ty difference imag¬ 
ing which could also be used to boost training numbers and 
build a classifier that could perform real-bogus classification 
for both surveys. In our analysis we have not considered the 
case of asteroids as these are typically removed during the 
construction of the nightly stacks in the MDS. Including 
the PS1 37r data, where differencing is performed on indi¬ 
vidual exposures, would allow us to test the performance of 
our method on asteroids. It may also be more beneficial to 
apply this approach at the source extraction stage. By work¬ 
ing directly on the pixel data the classifier could potentially 
learn which sources to extract and which to discard from a 
difference image before any further processing of a potential 
detection is performed. 

The dependence of any machine learning approach to 
real-bogus classification on large amounts of training data 
presents a serious problem for any new survey. While many 
sources of processing artefacts are common across surveys, 
differing pixel scales and seeing conditions prevent the use 
of a classifier trained on one survey being directly applied to 
another. A solution would be to build a training set based on 
hand labelled commissioning data and periodically retrain 
the classifier as new data become available. Alternatively an 
initial classifier trained on the limited data available early in 
a survey could be improved on by employing online learning, 
where the classifier is automatically updated as new labelled 
data are gathered ( Shalev-Shwartz|201 1 Saffari et al.|2009 ). 

Future work will focus on combining the remaining PS1 
data available into a single training set that will hopefully 
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address the S/N issue. Other areas of research could include 


the use of semi-supervised feature learning ( Raina et al. 
2007) and deep learning ( Coates et al.||2013 ) that retain all 


the advantages of our current approach at the expense of be¬ 
ing more computationally demanding. However, the added 
representational power of larger ANNs and the possibility of 
applying the unsupervised features learnt from one survey 
to a variety of other surveys could mean this is a promising 
domain to explore. 

An efficient real-bogus classifier is only one step toward 
rapid discovery and classification of transients. With next 
generation surveys the stream of transients will need to be 
prioritised based on scientific goals. Providing a contextual 
classification ( Djorgovski et ah||2012 Bloom et al.||2012 ) of 
the transients detected would allow researchers to select the 
most promising candidates for their research goals and will 
also be the focus of future work. 
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Table 4. First detections of the 53 spectroscopically confirmed PS1 supernovae ordered by hypothesis. (Due 
to sky cells some SNe appear twice.) 


Name 

Classification 

First Detection (MJD) 

Magnitude 

Filter 

Hypothesis 

PSl-13duq 

la 

56588.381 

21.64 

ipi 

0.989 

PSl-13bzp 

la 

56478.276 

21.33 

gPi 

0.98 

PSl-13bzp 

la 

56478.276 

21.30 

gPi 

0.975 

PSl-13abg 

II-P 

56383.336 

21.47 

Zpi 

0.971 

PSl-13bqg 

la 

56443.287 

20.90 

gPi 

0.968 

PSl-13abg 

II-P 

56383.336 

21.55 

Zpi 

0.963 

PSl-14il 

Iln 

56676.554 

21.36 

Zpi 

0.959 

PSl-13vc 

la 

56351.476 

20.67 

Zpi 

0.958 

PSl-13abw 

Ic 

56383.438 

21.23 

Zpi 

0.958 

PSl-14ky 

II 

56681.499 

21.60 

Zpi 

0.957 

PSl-13ur 

la 

56351.525 

20.32 

Zpi 

0.955 

PSl-13eae 

II 

56604.598 

19.82 

ypi 

0.954 

PSl-13alz 

II-P 

56399.282 

20.43 

ipi 

0.952 

PSl-12cnr 

la 

56283.340 

20.33 

Zpi 

0.948 

PSl-13can 

la 

56477.567 

22.04 

Zpi 

0.946 

PSl-13cws 

la 

56549.460 

21.48 

Zpi 

0.943 

PSl-13ge 

la 

56328.612 

21.49 

gPi 

0.94 

PSl-12cho 

la 

56262.469 

21.07 

Zpi 

0.933 

PSl-12cey 

II 

56268.294 

22.06 

gPi 

0.931 

PSl-13bok 

I 

56424.560 

22.35 

rpi 

0.926 

PSl-13djz 

Ic 

56554.585 

20.74 

Zpi 

0.923 

PSl-13a 

la 

56289.280 

21.20 

Zpi 

0.919 

PSl-13bit 

la 

56420.548 

22.69 

ipi 

0.918 

PSl-13bqb 

la 

56443.287 

22.32 

gPi 

0.918 

PSl-13djj 

la 

56563.576 

20.72 

gPi 

0.916 

PSl-12bza 

II-P 

56262.469 

21.12 

Zpi 

0.914 

PSl-13brf 

la 

56443.324 

22.68 

rpi 

0.91 

PSl-13hp 

II-P 

56325.545 

21.40 

gPi 

0.907 

PSl-13adg 

la 

56384.515 

21.71 

rpi 

0.902 

PSl-13awf 

I 

56417.315 

22.47 

ipi 

0.899 

PSl-13atm 

II-P 

56410.298 

22.25 

Zpi 

0.898 

PSl-13cjb 

II 

56501.436 

22.63 

gPi 

0.896 

PSl-12cho 

la 

56262.469 

21.08 

Zpi 

0.884 

PSl-13cai 

la 

56477.567 

21.99 

Zpi 

0.882 

PSl-13bni 

II-P 

56420.548 

23.07 

ipi 

0.873 

PSl-13bog 

la 

56417.341 

22.57 

ipi 

0.867 

PSl-12chw 

la 

56262.313 

21.21 

ypi 

0.857 

PSl-13bqv 

la 

56442.486 

21.64 

Zpi 

0.849 

PSl-13djs 

la 

56562.587 

21.44 

Zpi 

0.84 

PSl-13aai/SN 2013au 

la 

56370.421 

19.29 

Zpi 

0.833 

PSl-13cuc/SN 2013go 

la 

56536.587 

19.03 

Zpi 

0.814 

PSl-13hs 

I 

56328.515 

21.97 

gPi 

0.801 

PSl-13baf 

II-P 

56414.521 

22.54 

ipi 

0.801 

PSl-13avb 

lb 

56414.521 

21.75 

ipi 

0.796 

PSl-13ayn 

la 

56416.449 

22.01 

rpi 

0.77 

PSl-13aai/SN 2013au 

la 

56370.421 

19.36 

Zpi 

0.735 

PSl-13fo/SN 2013X 

la 

56314.625 

18.04 

ypi 

0.713 

PSl-13bus 

la 

56462.440 

22.91 

ipi 

0.708 

PSl-13brw 

II-P 

56436.325 

20.71 

ypi 

0.707 

PSl-13hi 

Iln 

56324.604 

18.50 

Zpi 

0.704 

PSl-13bvc 

la 

56469.349 

21.90 

Zpi 

0.698 

PSl-13bzk 

la 

56468.571 

21.38 

ypi 

0.662 

PSl-13abf 

la 

56380.368 

20.21 

ypi 

0.642 

PSl-13arv 

la 

56409.239 

20.98 

ypi 

0.638 

PSl-13wr 

II-P 

56349.602 

20.39 

ypi 

0.637 

PSl-14xz/SN 2014bc 

II-P 

56399.380 

18.25 

ipi 

0.623 

PSl-13wr 

II-P 

56349.602 

20.40 

ypi 

0.61 

PSl-13buf 

la 

56461.299 

22.60 

Zpi 

0.577 
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